26 research outputs found
Near-optimal bounds for phase synchronization
The problem of phase synchronization is to estimate the phases (angles) of a
complex unit-modulus vector from their noisy pairwise relative measurements
, where is a complex-valued Gaussian random matrix.
The maximum likelihood estimator (MLE) is a solution to a unit-modulus
constrained quadratic programming problem, which is nonconvex. Existing works
have proposed polynomial-time algorithms such as a semidefinite relaxation
(SDP) approach or the generalized power method (GPM) to solve it. Numerical
experiments suggest both of these methods succeed with high probability for
up to , yet, existing analyses only
confirm this observation for up to . In this
paper, we bridge the gap, by proving SDP is tight for , and GPM converges to the global optimum under
the same regime. Moreover, we establish a linear convergence rate for GPM, and
derive a tighter bound for the MLE. A novel technique we develop
in this paper is to track (theoretically) closely related sequences of
iterates, in addition to the sequence of iterates GPM actually produces. As a
by-product, we obtain an perturbation bound for leading
eigenvectors. Our result also confirms intuitions that use techniques from
statistical mechanics.Comment: 34 pages, 1 figur
The Interpolation Phase Transition in Neural Networks: Memorization and Generalization under Lazy Training
Modern neural networks are often operated in a strongly overparametrized
regime: they comprise so many parameters that they can interpolate the training
set, even if actual labels are replaced by purely random ones. Despite this,
they achieve good prediction error on unseen data: interpolating the training
set does not induce overfitting. Further, overparametrization appears to be
beneficial in that it simplifies the optimization landscape. Here we study
these phenomena in the context of two-layers neural networks in the neural
tangent (NT) regime. We consider a simple data model, with isotropic feature
vectors in dimensions, and hidden neurons. Under the assumption (for a constant), we show that the network can exactly interpolate the
data as soon as the number of parameters is significantly larger than the
number of samples: . Under these assumptions, we show that the
empirical NT kernel has minimum eigenvalue bounded away from zero, and
characterize the generalization error of min- norm interpolants, when
the target function is linear. In particular, we show that the network
approximately performs ridge regression in the raw features, with a strictly
positive `self-induced' regularization.Comment: 69 pages, 4 figure
Differentially Private Data Releasing for Smooth Queries with Synthetic Database Output
We consider accurately answering smooth queries while preserving differential
privacy. A query is said to be -smooth if it is specified by a function
defined on whose partial derivatives up to order are all
bounded. We develop an -differentially private mechanism for the
class of -smooth queries. The major advantage of the algorithm is that it
outputs a synthetic database. In real applications, a synthetic database output
is appealing. Our mechanism achieves an accuracy of , and runs in polynomial time. We also
generalize the mechanism to preserve -differential privacy
with slightly improved accuracy. Extensive experiments on benchmark datasets
demonstrate that the mechanisms have good accuracy and are efficient
Unraveling Projection Heads in Contrastive Learning: Insights from Expansion and Shrinkage
We investigate the role of projection heads, also known as projectors, within
the encoder-projector framework (e.g., SimCLR) used in contrastive learning. We
aim to demystify the observed phenomenon where representations learned before
projectors outperform those learned after -- measured using the downstream
linear classification accuracy, even when the projectors themselves are linear.
In this paper, we make two significant contributions towards this aim.
Firstly, through empirical and theoretical analysis, we identify two crucial
effects -- expansion and shrinkage -- induced by the contrastive loss on the
projectors. In essence, contrastive loss either expands or shrinks the signal
direction in the representations learned by an encoder, depending on factors
such as the augmentation strength, the temperature used in contrastive loss,
etc. Secondly, drawing inspiration from the expansion and shrinkage phenomenon,
we propose a family of linear transformations to accurately model the
projector's behavior. This enables us to precisely characterize the downstream
linear classification accuracy in the high-dimensional asymptotic limit. Our
findings reveal that linear projectors operating in the shrinkage (or
expansion) regime hinder (or improve) the downstream classification accuracy.
This provides the first theoretical explanation as to why (linear) projectors
impact the downstream performance of learned representations. Our theoretical
findings are further corroborated by extensive experiments on both synthetic
data and real image data
Tractability from overparametrization: The example of the negative perceptron
In the negative perceptron problem we are given data points
, where is a -dimensional
vector and is a binary label. The data are not linearly
separable and hence we content ourselves to find a linear classifier with the
largest possible \emph{negative} margin. In other words, we want to find a unit
norm vector that maximizes . This is a non-convex
optimization problem (it is equivalent to finding a maximum norm vector in a
polytope), and we study its typical properties under two random models for the
data.
We consider the proportional asymptotics in which with
, and prove upper and lower bounds on the maximum margin
or -- equivalently -- on its inverse function
. In other words, is the
overparametrization threshold: for a classifier achieving vanishing
training error exists with high probability, while for it does not. Our bounds on
match to the leading order as .
We then analyze a linear programming algorithm to find a solution, and
characterize the corresponding threshold . We
observe a gap between the interpolation threshold
and the linear programming threshold , raising the
question of the behavior of other algorithms.Comment: 88 pages; 7 pdf figure